Loading necessary libraries

library(roahd)

Functional data analysis with R: a primer

Functional Data Analysis is one of the core research areas of our department, you have already seen during class that some of the state-of-the-art techniques were actually developed here. Depth-based techniques find a natural application in Functional Statistics. I assume that many of you already know what we mean by functional data (applied statistics course docet), nevertheless do not worry if you do not: I have structured this lab as a very basic primer to understand how to do some methodological and applied work in the framework of functional data analysis (FDA).

According to the FDA model, data can be seen as measurements of a quantity/quantities along a given, independent and continuous indexing variable (such as time or space). Observations are then treated as random functions and can be viewed as trajectories of stochastic processes defined on a given infinite dimensional functional space.

A natural way to reason (and actually comprehend why they need specific and peculiar techniques) is to think about them as relatively dense longitudinal data: unlike tabular data, functional data are not invariant to permutations of their dimensions (i.e., if I switch two columns of a standard multivariate dataset no harm is done, if I do it with a functional one, I will create a bloody mess).

Given its ubiquitous presence in applications FDA has been a very active research field in the past decades, with a plethora of R packages developed for helping practitioner efficiently and effectively employ such techniques. In what follows we will focus our attention on the roahd package (developed by members and former members of our Department of Mathematics): a package meant to collect and provide methods for the analysis of univariate and multivariate functional datasets through the use of robust methods. First off we will dedicate some time in understanding how to represent these complex infinite dimensional objects (our computers only deal with finite approximations) and how to simulate functional data. We will then focus on computation of depths and outlier detection. This lab WILL NOT provide a thorough treatment of FDA, but we will limit to study some introductory concepts on how to deal with univariate (and bivariate, only very briefly) functional datasets.

Simulating functional data with roahd package

The way roahd package represents functional objects is by providing an evenly spaced grid \(I=\left[t_{0}, t_{1}, \ldots, t_{P-1}\right]\) \(\left(t_{j}-t_{j-1}=h>0, \: \forall j=1, \ldots, P-1\right)\) over which the functional observations \(D_{i, j}=X_{i}\left(t_{j}\right)\) \(\forall i=1, \ldots, N\) and \(\forall j=0, \ldots, P-1\) are measured. This is very conveniently handled by the fData object class. In particular, the following model is considered for the generation of data: \[ X(t)=m(t)+\varepsilon(t), \text { for all } t \text { in } I \] where \(m(t)\) is the mean function and \(\varepsilon(t)\) is a centered Gaussian process with covariance function \(C(\cdot,\cdot)\). That is to say: \[ \operatorname{Cov}(\varepsilon(\mathrm{s}), \varepsilon(t))=C(s, t), \text { with } s, t \text { in } I \] The employed structure for \(C(s, t)\) is the Exponential covariance function:

\[ C(s, t)=\alpha e^{-\beta|s-t|} \]

P <- 101
grid <-  seq( 0, 1, length.out =  P)
  
alpha <-  0.2
beta <-  0.2

C_st <- exp_cov_function( grid, alpha, beta )

C_st contains a \(P \times P\) matrix of values.

image( C_st,
       main = 'Exponential covariance function',
       xlab = 'grid', ylab = 'grid')

After having defined the mean function \(m(t)\)

m <- sin(pi*grid)+sin(2*pi*grid)

we are ready to generate functional data as follows

n <- 100
set.seed(26111992)
data <- generate_gauss_fdata(N = n,centerline = m,Cov=C_st)

The output of the previous chunk is actually a \(n \times P\) matrix, where \(P\) is equal to the length of the grid. That is, instead of having a “proper” functional datum, I have its evaluation on a relatively fine grid (this is actually the best we can do). We can plot it with:

matplot(grid,t(data), type="l", col=adjustcolor(col=1,alpha.f = .4))
lines(grid,m, col="blue", lwd=5)

Or even better we exploit the features of the roahd package to construct a functional object:

f_data <- fData(grid,data)
plot(f_data) # what happens if I do plot(data)?
lines(grid,m, col="black", lwd=5)

The first command defines an object of class fdata

class(f_data)
## [1] "fData"

with dedicated methods for plots, the four basic algebraic operations, subsetting and much more. In particular, using objects of class fdata makes the computation of depth measures and related quantities much easier (see next section). For a thorough account on the potential of the roahd package, check this paper.

By changing the hyperparameters \(\alpha\) and \(\beta\) in the Exponential covariance function, we can generate functional datum with different degree of dependence and variability. In details, \(\alpha\) controls the overall level of variability in the signal, while the parameter \(\beta\) affects the autocorrelation length of the signal’s noise, with lower values of \(\beta\) leading to wider correlation lengths and vice-versa

alpha <-  1
beta <-  0.2

C_st <- exp_cov_function( grid, alpha, beta )

data <- generate_gauss_fdata(N = n,centerline = m,Cov=C_st)

f_data <- fData(grid,data)
plot(f_data, main="High overall level of variability")

alpha <-  .1
beta <-  0.0001

C_st <- exp_cov_function( grid, alpha, beta )

data <- generate_gauss_fdata(N = n,centerline = m,Cov=C_st)

f_data <- fData(grid,data)
plot(f_data, main="High smothness") 

alpha <-  .1
beta <-  100

C_st <- exp_cov_function( grid, alpha, beta )

data <- generate_gauss_fdata(N = n,centerline = m,Cov=C_st)

f_data <- fData(grid,data)
plot(f_data, main="Virtually uncorrelated signals") 

Computing depth measures in a FDA framework

Let us now consider the data you have seen in the Case Study: ECG signals. The registered and smoothed signals are contained in the mfD_healthy object.

data("mfD_healthy") # 
univariate_fdata <- mfD_healthy$fDList[[1]] # I consider the first lead only
plot(univariate_fdata)

With roadhd it is very easy to compute Band depths and Modified band depths for a given fdata object:

band_depth <- BD(Data = univariate_fdata)
modified_band_depth <- MBD(Data = univariate_fdata)

We can compute the median curve

median_curve <- median_fData(fData = univariate_fdata, type = "MBD") # still an fData object

Or manually by firstly computing the Modified Band Depth and then identifying the curve with max MBD

median_curve_manual <- univariate_fdata[which.max(modified_band_depth),] # still an fData object
all(median_curve_manual$values==median_curve$values) 
## [1] TRUE
plot(univariate_fdata)
grid_ecg <- seq(median_curve_manual$t0,median_curve_manual$tP,by=median_curve_manual$h)
lines(grid_ecg,median_curve_manual$values)

Epigraph Index (EI) and Hypograph Index (HI) or their corresponding Modified versions (MEI and MHI) for providing down-upward/up-downward order of data can also be computed: see functions EI, HI, MEI, MHI. As an example, let us compute the Spearman’s correlation index between the first and second lead of the ECG signals. Recall that for a bivariate functional dataset \([\mathbf{x} \mathbf{y}]\)

\[ \hat{\rho}_{s}(\mathbf{x}, \mathbf{y})=\hat{\rho}_{p}\left(I L_{n}-\operatorname{grade}(\mathbf{x}), I L_{n}-\operatorname{grade}(\mathbf{y})\right) \] where \(\hat{\rho}_{p}\) is the sample Pearson correlation coefficient and \[ \begin{aligned} &I L_{n}-\operatorname{grade}(\mathbf{x})=\left(I L_{n}-\operatorname{grade}\left(x_{1}\right), I L_{n}-\operatorname{grade}\left(x_{2}\right), \ldots, I L_{n}-\operatorname{grade}\left(x_{n}\right)\right) \\ &I L_{n}-\operatorname{grade}(\mathbf{y})=\left(I L_{n}-\operatorname{grade}\left(y_{1}\right), I L_{n}-\operatorname{grade}\left(y_{2}\right), \ldots, I L_{n}-\operatorname{grade}\left(y_{n}\right)\right) . \end{aligned} \] where the inferior length grade \(I L_{n}-\operatorname{grade}(x_i)\) can be interpreted as the “proportion of time” that the sample \(x_1,\ldots,x_n\) is smaller than \(x_i\), that is it defines the relative position of a curve with respect to the sample.

The Spearman’s correlation index is immediately obtained via the cor_spearman function

bivariate_data <- as.mfData(list(mfD_healthy$fDList[[1]], mfD_healthy$fDList[[2]]))
plot(bivariate_data)

cor_spearman(bivariate_data, ordering='MHI')
## [1] 0.6811366

Or actually we can manually compute it:

MHI_first_lead <- MHI(bivariate_data$fDList[[1]])
MHI_second_lead <- MHI(bivariate_data$fDList[[2]])

cor(MHI_first_lead, MHI_second_lead)
## [1] 0.6811366

Coding tip: never forget the power of functional programming!

do.call(args = lapply(1:2, function(ind)
  MHI(bivariate_data$fDList[[ind]])), what = "cor")
## [1] 0.6811366

Outlier detection with roahd: functional boxplots and outliergrams

We conclude this lab by looking at some graphical tools for identyfing outliers in a functional sample. Consider the following univariate functional data

alpha <-  0.2
beta <-  0.002

C_st <- exp_cov_function( grid, alpha, beta )

data <- generate_gauss_fdata(N = n,centerline = m,Cov=C_st)

f_data <- fData(grid,data)

We consider two cases: first a dataset with some magnitude outliers, obtained inflating the last 10 curves by a number generated from a uniform distribution in \([2,3]\)

set.seed(33) # reproducibility
outlier_share <- .1
n_outliers <-   n*outlier_share
out_highlighter <- rep(c(1,2),c(n-n_outliers,n_outliers))
f_data_temp <- f_data[1:(n*(1-outlier_share)),] # Coding tip: subsetting is mabe possible by the S3 class fdata
mag_temp <- f_data[(n*(1-outlier_share)+1):n,] * runif(10,2,3)

f_data_mag <- append_fData(f_data_temp,mag_temp)
plot(f_data_mag, col=out_highlighter)

And then some shape outliers by shifting the generating mechanism by the quantity shift_q:

shift_q <- .5

mu_warp=mu=sin(pi*grid+shift_q)+sin(2*pi*grid+shift_q)

shape_temp=generate_gauss_fdata(N = n_outliers, mu_warp, Cov=C_st)
shape_temp=fData(grid,shape_temp)
f_data_shape=append_fData(f_data_temp,shape_temp) 
plot(f_data_shape, col=out_highlighter)

Now let us see whether we can employ the two plotting tools we have seen in class, namely functional boxplots and outliegrams to detect magnitude and shape outliers, respectively.

invisible(fbplot(f_data_mag, main="Magnitude outliers"))

invisible(outliergram(f_data_mag))

I used invisible() in the previous chunk to avoid the output of the call to be printed on the console (and in the knitted html file). If I do not use it I obtain the following:

fbplot(f_data_shape, main="Shape outliers")

## $Depth
##   [1] 0.11941794 0.47258526 0.32503850 0.50875688 0.51202120 0.42418442
##   [7] 0.05471947 0.44966697 0.34153615 0.41094309 0.49388739 0.25033103
##  [13] 0.46148415 0.51209321 0.39750175 0.04440644 0.16522252 0.15912591
##  [19] 0.51308931 0.35234923 0.41615962 0.25932393 0.49001100 0.50787679
##  [25] 0.35572957 0.32413041 0.18964096 0.21727573 0.18282428 0.09815182
##  [31] 0.45249925 0.44432643 0.25574757 0.39053305 0.44882288 0.49020702
##  [37] 0.50445645 0.22153215 0.40049405 0.14020402 0.31457346 0.51088509
##  [43] 0.48876288 0.44612261 0.21853985 0.15872187 0.32537054 0.50789279
##  [49] 0.50629263 0.47253325 0.48965097 0.19740574 0.50759276 0.51214521
##  [55] 0.43428943 0.46780878 0.02000000 0.30044804 0.29381538 0.47728973
##  [61] 0.07658566 0.51019702 0.08822682 0.37105511 0.48971897 0.46210421
##  [67] 0.46548055 0.34415642 0.27434543 0.48607861 0.30566457 0.35639364
##  [73] 0.50384438 0.12681068 0.48860286 0.41711571 0.50476048 0.36390239
##  [79] 0.48821482 0.23406941 0.37900390 0.10801280 0.42349635 0.02674867
##  [85] 0.50020802 0.40249825 0.16453445 0.25549155 0.29016702 0.06704070
##  [91] 0.38340834 0.27212521 0.36185019 0.37102310 0.21195920 0.37791579
##  [97] 0.36079408 0.18767277 0.38006801 0.37687169
## 
## $Fvalue
## [1] 1.5
## 
## $ID_outliers
## [1] 57
outliergram(f_data_shape)

## $Fvalue
## [1] 1.5
## 
## $d
##   [1]  1.207844e-03  3.293399e-03  1.788100e-03  2.606875e-03  2.973406e-03
##   [6]  2.353622e-03  6.641456e-04  1.843709e-03  9.445697e-04  1.423826e-03
##  [11]  3.408460e-03  2.778773e-03  3.102488e-03  2.845790e-03  1.662067e-03
##  [16]  3.659772e-05  3.524313e-04  1.440025e-03  1.899754e-03  2.638600e-03
##  [21]  2.334570e-03  1.505972e-03  2.209369e-03  1.716607e-03  6.064369e-04
##  [26]  3.671654e-03  1.726549e-03  1.964791e-03  1.459829e-03  1.160235e-03
##  [31]  2.783249e-03  1.235807e-03  5.213987e-04  2.320628e-03  2.403805e-03
##  [36]  3.966139e-03  1.903874e-03  1.370513e-03  2.375525e-03  1.346630e-03
##  [41]  3.234937e-03  3.455316e-03  3.201904e-03  2.374574e-03  2.412915e-04
##  [46]  3.344097e-04  7.228050e-04  2.146314e-03  3.383229e-03  2.054542e-03
##  [51]  4.073001e-03  1.761364e-04  2.326688e-03  2.406221e-03  2.324312e-03
##  [56]  2.424124e-03 -4.267420e-16  3.873259e-04  5.901580e-04  3.073020e-03
##  [61]  8.517683e-04  3.343186e-03  1.261116e-04  1.014121e-03  1.513894e-03
##  [66]  2.671832e-03  1.909736e-03  8.865837e-04  1.715776e-03  1.881218e-03
##  [71]  1.248838e-03  9.497583e-04  3.332809e-03  1.524509e-04  2.803251e-03
##  [76]  1.553264e-03  3.588953e-03  2.145957e-03  1.743065e-03  4.286369e-04
##  [81]  2.378654e-03  1.434203e-04  1.359542e-03  1.577584e-04  2.259632e-03
##  [86]  1.311814e-03  2.081396e-04  2.044561e-04  1.985545e-03  9.707901e-05
##  [91]  1.198346e-01  1.259825e-01  9.742135e-02  1.002164e-01  1.050986e-01
##  [96]  1.337795e-01  1.378603e-01  1.794469e-02  1.348418e-01  1.147931e-01
## 
## $ID_outliers
##  [1]  91  92  93  94  95  96  97  98  99 100

We appreciate how the shape outliers are almost entirely missed by the functional boxplot, whereas the outliergram effectively uncovers them.

As usual, saving the output of the plots to an object allows to recover the ID of the identified outliers:

out_shape <- outliergram(f_data_shape, display = FALSE)
out_shape$ID_outliers
##  [1]  91  92  93  94  95  96  97  98  99 100

The outliergram is actually based on some simple manipulations of MEI and MBD, that can be hard coded:

MEI_out_shape <- MEI(f_data_shape)
MBD_out_shape <- MBD(f_data_shape)
a_0 <- a_2 <- -2/(f_data_shape$N*(f_data_shape$N-1))
a_1 <- 2*(f_data_shape$N+1)/(f_data_shape$N-1)
d_manual <- a_0+a_1*MEI_out_shape+a_2*f_data_shape$N^2*MEI_out_shape^2-MBD_out_shape
ID_outliers_manual <- which(d_manual>quantile(d_manual,probs = .75)+out_shape$Fvalue*IQR(x = d_manual))

Let us check that our manual computations are correct:

all(dplyr::near(d_manual, out_shape$d))
## [1] TRUE
all(ID_outliers_manual==out_shape$ID_outliers)
## [1] TRUE

One last comment: we have seen in class that the value \(F\) used to build the fences in the functional boxplot can be adjusted. Despite the algorithmic implementation/ mathematical theory behind it being quite tedious, the automated selection of \(F\) via the roahd package is actually pretty easy:

set.seed(22)
fbplot(f_data_mag, main="Magnitude outliers",adjust = list(VERBOSE=FALSE))

## $Depth
##   [1] 0.17158716 0.48351435 0.36043404 0.51197720 0.51346535 0.44345035
##   [7] 0.11140114 0.44828283 0.37454345 0.41143114 0.50079208 0.29340334
##  [13] 0.47406541 0.51069707 0.39913391 0.07915392 0.19287329 0.20865087
##  [19] 0.51244124 0.38464846 0.43680168 0.30039204 0.49752775 0.50558056
##  [25] 0.36416642 0.35985799 0.23734973 0.26231223 0.23089309 0.15360136
##  [31] 0.44980698 0.44241424 0.27598960 0.41755976 0.46359236 0.49864386
##  [37] 0.50922892 0.26684068 0.42570057 0.18997300 0.35127313 0.51342134
##  [43] 0.49773177 0.46192419 0.24127413 0.18600060 0.33606361 0.51169317
##  [49] 0.50508051 0.46992899 0.48732673 0.22400840 0.50558056 0.51399340
##  [55] 0.45174317 0.46573257 0.04924692 0.31222522 0.30632063 0.48883488
##  [61] 0.13333533 0.51051305 0.11827383 0.37881988 0.48634263 0.46022802
##  [67] 0.47826583 0.35446945 0.31427743 0.48341834 0.31714971 0.36421842
##  [73] 0.50850485 0.15533353 0.48600260 0.41882788 0.50191619 0.39486149
##  [79] 0.49565157 0.25567157 0.40775478 0.13715172 0.42486449 0.08306631
##  [85] 0.49748375 0.40545055 0.19164516 0.27580158 0.32959496 0.09931193
##  [91] 0.18090009 0.22688469 0.17876788 0.14172017 0.02582458 0.21552755
##  [97] 0.14330033 0.20568257 0.25305131 0.07321732
## 
## $Fvalue
## [1] 1.545459
## 
## $ID_outliers
## [1]  91  92  93  94  95  96  97  98 100
outliergram(f_data_mag,adjust = list(VERBOSE=FALSE))

## $Fvalue
## [1] 4.867737
## 
## $d
##   [1] 1.269870e-03 2.801349e-03 1.347818e-03 1.719855e-03 1.165661e-03
##   [6] 1.591763e-03 1.409289e-03 1.082484e-03 1.435708e-03 2.820559e-03
##  [11] 2.784199e-03 1.662661e-03 2.588734e-03 3.433730e-03 2.127738e-03
##  [16] 1.071711e-03 5.723741e-04 1.179009e-03 2.221489e-03 2.030381e-03
##  [21] 1.458126e-03 1.613983e-03 2.153483e-03 1.013606e-03 1.881931e-03
##  [26] 2.584179e-03 1.602418e-03 1.770791e-03 1.065136e-03 1.244600e-03
##  [31] 2.973802e-03 1.506765e-03 1.308210e-03 1.632639e-03 1.875118e-03
##  [36] 2.498666e-03 1.236718e-03 1.034480e-03 1.672603e-03 1.234381e-03
##  [41] 1.908270e-03 1.560235e-03 1.704646e-03 1.198496e-03 1.362552e-03
##  [46] 1.834797e-03 1.472266e-03 1.182732e-03 1.693318e-03 1.428222e-03
##  [51] 2.497359e-03 1.323023e-03 1.446560e-03 1.006002e-03 1.528074e-03
##  [56] 1.165146e-03 7.784541e-04 1.437173e-03 1.106012e-03 8.102592e-04
##  [61] 1.303734e-03 1.084425e-03 1.113735e-03 1.118448e-03 1.292248e-03
##  [66] 1.604636e-03 8.816723e-04 9.684137e-04 1.651769e-03 1.155640e-03
##  [71] 2.155622e-03 2.264108e-03 2.493754e-03 1.199526e-03 1.725321e-03
##  [76] 1.407071e-03 3.198815e-03 1.604636e-03 2.090823e-03 1.598576e-03
##  [81] 1.615053e-03 1.123835e-03 1.254502e-03 1.232559e-03 1.278385e-03
##  [86] 6.934159e-04 1.800497e-03 1.496229e-03 1.601269e-03 1.088188e-03
##  [91] 3.157153e-01 2.550829e-01 1.907362e-01 3.600123e-01 9.818804e-05
##  [96] 5.998303e-02 1.709803e-02 2.930078e-01 2.140930e-01 6.755923e-04
## 
## $ID_outliers
## [1] 91 92 93 94 96 97 98 99